Objective
To make an open-access digital R reference book which is catered to epidemiologists and public health practitioners, usable offline, and addresses common epidemiological tasks via clear text explanations, step-by-step instructions, and best practice R code examples
The problem:
Most online R help resources are not task-centered nor epidemiology-focused. Epis learning or new to R often must Google and skim dozens of forum pages to complete common data manipulation and visualization epi tasks. Furthermore, field epidemiologists often work in low internet-connectivity environments and have limited technical support.
How to read this handbook:
This handbook in an HTML file. It is not online, you are only using your web browser to view this local file.
This handbook is best viewed with Google Chrome. Some functions may not work in other browsers.
Use tabs on the right to hide/view code. See ‘Copy to clipboard’ icon in the upper-right of each code section
Version
The latest version of this handbook can be found at this github repository.
Style
mutate().dplyr::mutate()), so it is clear to the reader which package is being used.Types of notes
FOR EXAMPLE: This is a boxed example
NOTE: This is a note
TIP: This is a tip.
CAUTION: This is a cautionary note.
DANGER: This is a warning.
Here the datasets used in this handbook will be described and will be “downloadable” via link (the files will be stored within the HTML, so available offline as well)
Editor-in-Chief: Neale Batra (neale.batra@gmail.com)
Editorial core team: …
Authors: …
Reviewers: …
Advisors …
Data contributors:
outbreaks package
Some of this material comes from the R4Epis website, which was also made by some of the same people…
RECON packages
Photo credits (logo): CDC Public Image gallery; R Graph Gallery
This section is not meant as a comprehensive “how to learn R” tutorial. However, it does cover some of the fundamentals that can be good to reference or refresh.
More comprehensive tutorials are available online:
* Here
* and Here
* and even Here
* Oh yea and Here too (there’s a lot of them)
How to install R
How to install R Studio
Other things you may need to install:
* TinyTeX
* Pandoc
* RTools
First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.
For RStudio to function you must also have R installed on the computer (see this section for installation instructions).
RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward!
By default RStudio displays four rectangle panes.
TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.
The R Console Pane
The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.
If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.
The Source Pane
This pane, by default in the upper-left, is space to edit and run your scripts. This pane can also display datasets (data frames) for viewing.
For Stata users, this pane is similar to your Do-file and Data Editor windows.
The Environment Pane
This pane, by default the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). Click on the arrow next to a dataframe name to see its variables.
In Stata, this is most similar to Variables Manager window.
Plots, Packages, and Help Pane
The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).
This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.
Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options
R scripts (vs. typing in the console)
* Advantages (reproducability) * General sequence (into, load packages, load data, clean data, conduct analysis, save results) * Commenting
These tabs cover how to use R working directories, and how this changes when you are working within an R project. The working directory is the root file location used by R for your work.
By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.
NOTE: If using an [R project](#rproject), the working directory will default to the R project root folder **IF** you open RStudio by clicking open the R project (the file with .rproj extension))
Use the command setwd() with the filepath in quotations, for example: setwd("C:/Documents/R Files")
CAUTION: If using an RMarkdown script be aware of the following:
In an R Markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved to. If you want to change this, you can use setwd() as above, but know the change will only apply to that specific code chunk.
To change the working directory for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:
Setting your working directory manually (point-and-click)
From RStudio click: Session / Set Working Directory / Choose Directory (you will have to do this each time you open RStudio)
How things change in an R project
Everything in R is an object. These sections will explain:
<-)Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.
An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.
<-)Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:
object_name <- value (or process/calculation that produce a value)
EXAMPLE: You may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object
reporting_weekis created when it is assigned the character value"2018-W10"(the quote marks make these a character value).
The objectreporting_weekwill then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.
See the R commands and their output in the boxes below.
reporting_week <- "2018-W10" # this command creates the object reporting_week by assigning it a value
reporting_week # this command prints the current value of reporting_week object in the console
## [1] "2018-W10"NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output
CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.
The following command will re-define the value of reporting_week:
reporting_week <- "2018-W51" # assigns a NEW value to the object reporting_week
reporting_week # prints the current value of reporting_week in the console
## [1] "2018-W51"Datasets are also objects and must be assigned names when they are imported.
In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package.
# linelist_raw is created and assigned the value of the imported CSV file
linelist <- rio::import("my_linelist.csv")You can read more about importing and exporting datasets with the section on importing data.
CAUTION: A quick note on naming of objects:
Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.
The graphic below, sourced from this online R tutorial shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS section.
In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:
| Common structure | Explanation | Example from templates |
|---|---|---|
| Vectors | A container for a sequence of singular objects, all of the same class (e.g. numeric, character). | “Variables” (columns) in data frames are vectors (e.g. the variable age_years). |
| Data Frames | Vectors (e.g. columns) that are bound together that all have the same number of rows. | linelist_raw and linelist_cleaned are both data frames. |
Note that to create a vector that “stands alone”, or is not part of a data frame (such as a list of location names), the function c() is often used:
list_of_names <- c("Ruhengeri", "Gisenyi", "Kigali", "Butare")
All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:
| Class | Explanation | Examples |
|---|---|---|
| Character | These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. | “Character objects are in quotation marks” |
| Numeric | These are numbers and can include decimals. If within quotation marks the will be considered character. | 23.1 or 14 |
| Integer | Numbers that are whole only (no decimals) | -5, 14, or 2000 |
| Factor | These are vectors that have a specified order or hierarchy of values | Variable msf_involvement with ordered values N, S, SUB, and U. |
| Date | Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Dates for more information. | 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980 |
| Logical | Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) | TRUE or FALSE |
| data.frame | A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). | The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each. |
You can test the class of an object by feeding it to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.
class(linelist$age) # class should be numeric
## [1] "numeric"
class(linelist$gender) # class should be character
## [1] "character"Often, you will need to convert objects or variables to another class.
| Function | Action |
|---|---|
as.character() |
Converts to character class |
as.numeric() |
Converts to numeric class |
as.integer() |
Converts to integer class |
as.Date() |
Converts to Date class - Note: see section on dates for details |
as.factor() |
Converts to factor - Note: re-defining order of value levels requires extra arguments |
Here is more online material on classes and data structures in R.
$)Vectors within a data frame (variables in a dataset) can be called, referenced, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. The $ symbol must be used, otherwise R will not know where to look for or create the column.
# Retrieve the length of the vector age_years
length(linelist$age) # (age is a variable in the linelist data frame)By typing the name of the data frame followed by $ you will also see a list of all variables in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!
ADVANCED TIP: Some more complex objects (e.g. an epicontacts object may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset) .
[])You may need to view parts of objects, which is often done using the square brackets [ ].
To view specific rows and columns of a dataset, you can do this using the syntax dataframe[rows, columns]:
# View a specific row (2) from dataset, with all columns
linelist[2,]
# View all rows, but just one column
linelist[, "date_onset"]
# View values from row 2 and columns 5 through 10
linelist[2, 5:10]
# View values from row 2 and columns 5 through 10 and 18
linelist[2, c(5:10, 18)]
# View rows 2 through 20, and specific columns
linelist[2:20, c("date_onset", "outcome", "age")]
# View rows and columns based on criteria
# *** Note the dataframe must still be names in the criteria!
linelist[linelist$age > 25 , c("date_onset", "date_birth", "age")]
# Use View() to see the outputs in the RStudio Viewer pane (easier to read)
# *** Note the capital "V" in View() function
View(linelist[2:20, "date_onset"])
# Save as a new object
new_table <- linelist[2:20, c("date_onset")] The square brackets also work to call specific parts of summary() function:
This section on functions explains:
* What a function is and how they work
* What arguments are
* What packages are
* How to get help understanding a function
A function is like a machine that receives inputs, does some action with those inputs, and produces an output.
What the output is depends on the function.
Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:
Functions can also be applied to variables in a dataset. For example, when the function summary() is applied to the numeric variable age in the dataset linelist (what’s the $ symbol?), the output is a summary of the variable’s numeric and missing values.
summary(linelist$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.00 47.25 60.00 56.91 72.00 91.00 2NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.
Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.
For example, this
age_pyramid()command produces an age pyramid graphic based on defined age groups and a binary split variable, such asgender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establishlinelistas the data frame to use,age_groupas the variable to count, andgenderas the binary variable to use for splitting the pyramid by color.
NOTE: For this example, in the background we have created a new variable called “age_group”. To learn how to create new variable see that section of this handbook
# Creates an age pyramid by specifying the dataframe, age group variable, and a variable to split the pyramid
apyramid::age_pyramid(data = linelist, age_group = "age_group", split_by = "gender")The first half of an argument assignment (e.g.
data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame,age_groupvariable,split_byvariable.
# This command will produce the exact same graphic as above
apyramid::age_pyramid(linelist, "age_group", "gender")A more complex age_pyramid() command might include the optional arguments to:
proportional = TRUE when the default is FALSE)pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)NOTE: For arguments specified with an equals symbol (e.g. coltotals = ...), their order among the arguments is not important (must still be within the parentheses and separated by commas).
Packages contain functions.
On installation, R contains “base” functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use.
One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.
Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, you access its functions by loading the package with the library() command at the beginning of each R session.
NOTE: While you only have to install a package once, you must load it at the beginning of every R session using library() command, or an alternative like pacman’s p_load() function.
Think of R as your personal library: When you download a package your library gains a book of functions, but each time you want to use a function in that book, you must borrow that book from your library.
For clarity in this handbook, functions are usually preceeded by the name of their package using the :: symbol in the following way:
package_name::function_name()
Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However giving the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()).
Using the package name will also load the package if it is not already loaded.
# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")Dependencies
Packages often depend on other packages, and these are called “dependencies”. When a package is installed from CRAN, it will typically also install its dependenices.
To read more about a function, you can try searching online for resources OR search in the Help tab of the lower-right RStudio pane.
%>%)Two general approaches to R coding are:
Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.
Piping emphasizes a sequence of actions, not the object the actions are being performed on
Best when a sequence of actions must be performed on one object
from magrittr. Included in dplyr and tidyverse
Makes code more clean and easier to read, intuitive
express a sequence of operations
the object is altered and then passed on to the next function
Example:
# A fake example of how to bake a care using piping syntax
cake <- flour %>% # to define cake, start with flour, and then...
left_join(eggs) %>% # add eggs
left_join(oil) %>% # add oil
left_join(water) %>% # add water
mix_together(utensil = spoon, minutes = 2) %>% # mix together
bake(degrees = 350, system = "fahrenheit", minutes = 35) %>% # bake
let_cool() # let it cool downhttps://cfss.uchicago.edu/notes/pipes/#:~:text=Pipes%20are%20an%20extremely%20useful,code%20and%20combine%20multiple%20operations.
Piping is not a base function. To use piping, the dplyr package must be installed and loaded. Near the top of every template script is a code chunk that installs and loads the necessary packages, including dplyr. You can read more about piping in the documentation.
CAUTION: Remember that even when using piping to link functions, if the assignment operator (<-) is present, the object to the left will still be over-written (re-defined) by the right side.
TODO %<>% shortcut for re-defining the object and piping
Better if:
* You need to manipulate multiple objects
* There are intermediate steps that are meaningful and deserve separate object names
as changes are made - still handy to know
Risks: creating new objects for each step - lots of objects. If you use the wrong one you might not know. naming can be confusing, errors not easily detectable
either name each intermediate object, or overwrite the original, or combine all the functions together. all come with risks
https://style.tidyverse.org/pipes.html
# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)
batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)
cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)
cake <- let_cool(cake)Combine all functions together - also difficult to read
# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))This section details operators in R, such as:
* Relational operators (less than, equal too..)
* Logical operators (and, or…)
* Handling missing values
* Mathematical operators and functions (+/-, >, sum(), median(), …)
* The %in% operator
Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:
| Function | Operator | Example | Example Result |
|---|---|---|---|
| Equal to | == |
"A" == "a" |
FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <- |
| Not equal to | != |
2 != 0 |
TRUE |
| Greater than | > |
4 > 2 |
TRUE |
| Less than | < |
4 < 2 |
FALSE |
| Greater than or equal to | >= |
6 >= 4 |
TRUE |
| Less than or equal to | <= |
6 <= 4 |
FALSE |
| Value is missing | is.na() |
is.na(7) |
FALSE (see section on missing values) |
| Value is not missing | !is.na() |
!is.na(7) |
TRUE |
Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.
| Function | Operator |
|---|---|
| AND | & |
| OR | | (vertical bar) |
| Parentheses | ( ) Used to group criteria together and clarify order |
For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:
linelist_cleaned <- linelist_cleaned %>%
mutate(case_def = case_when(
is.na(hep_e_rdt) & is.na(other_cases_in_hh) ~ NA_character_,
hep_e_rdt == "Positive" ~ "Confirmed",
hep_e_rdt != "Positive" & other_cases_in_hh == "Yes" ~ "Probable",
TRUE ~ "Suspected"
))| Criteria in example above | Resulting value in new variable “case_def” |
|---|---|
If the value for variables hep_e_rdt and other_cases_in_hh are missing |
NA (missing) |
If the value in hep_e_rdt is “Positive” |
“Confirmed” |
If the value in hep_e_rdt is NOT “Positive” AND the value in other_cases_in_hh is “Yes” |
“Probable” |
| If one of the above criteria are not met | “Suspected” |
{{% notice tip %}} Note that R is case-sensitive, so “Positive” is different than “positive”… {{% /notice %}}
In R, missing values are represented by the special value NA (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA.
To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.
rdt_result <- c("Positive", "Suspected", "Positive", NA) # two positive cases, one suspected, and one unknown
is.na(rdt_result) # Tests whether the value of rdt_result is NA
## [1] FALSE FALSE FALSE TRUETo DO: SECTION ON OTHER NA TYPES: NA_character, NA_real etc. SECTION ON NULL
All the operators and functions in this page is automatically available using base R.
These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.
| Objective | Example in R |
|---|---|
| addition | 2 + 3 |
| subtraction | 2 - 3 |
| multiplication | 2 * 3 |
| division | 30 / 5 |
| exponent | 2^3 |
| order of operations | ( ) |
| Objective | Function |
|---|---|
| rounding | round(x, digits = n) |
| ceiling (round up) | ceiling(x) |
| floor (round down) | floor(x) |
| absolute value | abs(x) |
| square root | sqrt(x) |
| exponent | exponent(x) |
| natural logarithm | log(x) |
CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm=TRUE is specified
| Objective | Function |
|---|---|
| mean (average) | mean(x, na.rm=T) |
| median | median(x, na.rm=T) |
| standard deviation | sd(x, na.rm=T) |
| quantiles* | quantile(x, probs) |
| sum | sum(x, na.rm=T) |
| minimum value | min(x, na.rm=T) |
| maximum value | max(x, na.rm=T) |
| range of numeric values | range(x, na.rm=T) |
| summmary** | summary(x) |
*quantile(): x is the numeric vector to examine, and probs is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85)
**summary(): gives a summary on a numeric vector including mean, median, and common percentiles
| Objective | Function | Example |
|---|---|---|
| create a sequence | seq(from, to, by) | seq(1, 10, 2) |
| repeat x, n times | rep(x, ntimes) | rep(1:3, 2) or rep(c("a", "b", "c"), 3) |
| subdivide a numeric vector | cut(x, n) | cut(linelist$age, 5) |
%in%This section describes the several ways to install a package:
* Via the online package repository (CRAN)
* From a ZIP file
* From Github
This section explains:
* General syntax for writing R code
* Code assists
* the difference between errors and warnings
Common errors and warnings and their solutions can be found in X section (TODO).
A few things to remember when writing commands in R, to avoid errors and warnings:
Variable_A is different from variable_AAny script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.
(/images/Warnings_and_Errors.png)
When a command is run, the R Console may show you warning or error messages in red text.
A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.
An error means that R was not able to complete your command.
Look for clues:
The error/warning message will often include a line number for the problem.
If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.
If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!
Introduction to importing data
The key package we recommend for importing data is: rio. rio offers the useful function import() which can import many types of files into R.
The alternative to using rio would be to use functions from several other packages that are specific to a type of file (e.g. read.csv(), read.xlsx(), etc.). While these alternatives can be difficult to remember, always using rio::import() is relatively easy.
Optionally, the package here can be used in conjunction with rio. It locates files on your computer via relative pathways, usually within the context of an R project. Relative pathways are relative from a designated folder location, so that pathways listed in R code will not break when the script is run on a different computer.
This code chunk shows the loading of packages for importing data.
import()When you import a dataset, you are doing the following:
The function import() from the package rio makes it easy to import many types of data files.
# An example:
#############
library(rio) # ensure package rio is loaded for use
# New object is defined as the imported data
my_csv_data <- import("linelist.csv") # importing a csv file
my_Excel_data <- import("observations.xlsx", which = "February") # import an Excel fileimport() uses the file’s extension (e.g. .xlsx, .csv, .dta, etc.) to appropriately import the file. Any optional arguments specific to the filetype can be supplied as well.
You can read more about the rio package in this online vignette
CAUTION: In the example above, the datasets are assumed to be located in the working directory, or the same folder as the script.
A filepath can be provided in full (as below) or as a relative filepath (see next tab). Providing a full filepath can be fast and may be the best if referencing files from a shared/network drive).
The function import() (from the package rio) accepts a filepath in quotes. A few things to note:
If importing a specific sheet from an Excel file, include the sheet name in the which = argument of import(). For example:
# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("my_excel_file.xlsx", which = "Sheetname")If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parenthese of the here() function.
You can import data manually via one of these methods:
file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:# A demonstration showing manual selection of a file. When this command is run, a POP-UP window should appear.
# The filepath of the selected file will be supplied to the import() command.
my_data <- rio::import(file.choose())TIP: The pop-up window may appear BEHIND your RStudio window.
here())Relative filepaths differ from static filepaths in that they are relative from a R project root directory. For example:
import("C:/Users/nsbatra/My Documents/R files/epiproject/data/linelists/ebola_linelist.xlsx")
import(here("data", "linelists", "ebola_linelist.xlsx"))
The package here and it’s function here() facilitate relative pathways.
here() works best within R projects. When the here package is first loaded (library(here)), it automatically considers the top-level folder of your R project as “here” - a benchmark for all other files in the project.
Thus, in your script, if you want to import or reference a file saved in your R project’s folders, you use the function here() to tell R where the file is in relation to that benchmark.
If you are unsure where “here” is set to, run the function here() with the empty brackets:
Below is an example of importing the file “fluH7N9_China_2013.csv” which is located in the benchmark “here” folder. All you have to do is provide the name of the file in quotes (with the appropriate ending).
If the file is within a subfolder - let’s say a “data” folder - write these folder names in quotes, separated by commas, as below:
Using the here() command produces a character filepath, which can then processed by the import() function.
# the filepath
here("data", "fluH7N9_China_2013.csv")
# the filepath is given to the import() function
linelist <- import(here("data", "fluH7N9_China_2013.csv"))NOTE: You can still import a specific sheet of an excel file as noted in the Excel tab. The here() command only supplies the filepath.
Since a data frame is a combination of vertical vectors (columns), R by default expects manual entry of data to also be in vertical vectors (columns).
# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death <- c(1, 0, 1, 0)CAUTION: All vectors must be the same length (same number of values).
The vectors can then be bound together using the function data.frame():
# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)And now we display the new dataset:
Use the tribble function from the tibble package from the tidverse (onlinetibble reference).
Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.).
You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. For example:
# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
)And now we display the new dataset:
If you copy data from elsewhere and have it on your clipboard, you can try the following command to convert those data into an R data frame:
The following packages are recommended for working with dates:
# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(aweek, # flexibly converts dates to weeks, and vis-versa
lubridate, # for conversions to months, years, etc.
linelist, # function to guess messy dates
ISOweek) # another option for creating weeksas.Date()The standard, base R function to convert an object or variable to class Date is as.Date() (note capitalization).
as.Date() requires that the user specify the existing* format of the date*, so it can understand, convert, and store each element (day, month, year, etc.) correctly. Read more online about as.Date().
If used on a variable, as.Date() therefore requires that all the character date values be in the same format before converting. If your data are messy, try cleaning them or consider using guess_dates() from the linelist package.
It can be easiest to first convert the variable to character class, and then convert to date class:
as.character()as.Date()Within the as.Date() function, you must use the format= argument to tell R the current format of the date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (YYYY-MM-DD or YYYY/MM/DD) the format= argument is not necessary.
For example, if your character dates are in the format DD/MM/YYYY, like “24/04/1968”, then your command to turn the values into dates will be as below. Putting the format in quotation marks is necessary.
TIP: The format= argument is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.
TIP:Be sure that in the format= argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.
The as.character() and as.Date() commands can optionally be combined as:
linelist_cleaned$date_of_onset <- as.Date(as.character(linelist_cleaned$date_of_onset), format = "%d/%m/%Y")If using piping and the tidyverse, the above command might look like this:
linelist_cleaned <- linelist_cleaned %>%
mutate(date_of_onset = as.character(date_of_onset),
date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"))Once complete, you can run a command to verify the class of the variable
Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.
This is a section on using lubridate (Henry)
guess_dates()The function guess_dates() attempts to read a “messy” date variable containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(), which is in the linelist package.
For example:
guess_dateswould see the following dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them in the class Date to: 2018-01-03, 1982-03-07, and 1985-08-20.
linelist::guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85")) # guess_dates() not yet available on CRAN for R 4.0.2
# try install via devtools::install_github("reconhub/linelist")Some optional arguments for guess_dates() that you might include are:
error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)last_date - the last valid date (defaults to current date)first_date - the first valid date. Defaults to fifty years before the last_date.Excel stores dates as the number of days since December 30, 1899. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use the as.Date() function to convert, but instead of supplying a format as above, supply an origin date.
NOTE: You should provide the origin date in R’s default date format ("YYYY-MM-DD").
Once dates are the correct class, you often want them to display differently (e.g. in a plot, graph, or table). For example, to display as “Monday 05 Jan” instead of 2018-01-05. You can do this with the function format(), which works in a similar way as as.Date(). Read more in this online tutorial
%d = Day # (of the month e.g. 16, 17, 18…) %a = abbreviated weekday (Mon, Tues, Wed, etc.)
%A = full weekday (Monday, Tuesday, etc.)
%m = # of month (e.g. 01, 02, 03, 04)
%b = abbreviated month (Jan, Feb, etc.)
%B = Full Month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)
An example of formatting today’s date:
# today's date, with formatting
format(Sys.Date(), format="%d %B %Y")
## [1] "07 December 2020"
# easy way to get full date and time (no formatting)
date()
## [1] "Mon Dec 07 16:00:07 2020"
# formatted date, time, and time zone (using paste0() function)
paste0(format(Sys.Date(), format= "%A, %b %d '%y, %z %Z, "), format(Sys.time(), format = "%H:%M:%S"))
## [1] "Monday, Dec 07 '20, +0000 UTC, 16:00:07"The difference between dates can be calculated by:
TODO
The templates use the very flexible package aweek to set epidemiological weeks. You can read more about it on the RECON website
See the section on epicurves.
Sys.Date( ) returns the current date of your computerSys.Time() returns the current time of your computerdate() returns the current date and time.This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.
HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).
The cleaning steps demonstrated include:
case_when())replace missing with dealing with cases (all lower, etc) case_when() factors
Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)
You can view the original raw dataset below:
As explained in the section on dplyr and tidyverse coding style (LINK HERE), a chain of ‘verb’ functions operate on a dataset through ‘pipes’ (%>%), passing the output from one verb to the next.
order is important
TO DO
column names are used so often, it is best that they have “clean” syntax. We suggest the following:
The names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).
names(linelist_raw)
## [1] "row_num" "case_id" "generation" "infection date"
## [5] "date onset" "hosp date" "date_of_outcome" "outcome"
## [9] "gender" "hospital" "lon" "lat"
## [13] "infector" "source" "age" "age_unit"
## [17] "fever" "chills" "cough" "aches"
## [21] "vomit"Note: To use a column names that include spaces, surround the name with back-ticks, for example: linelist$`infection date`
On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).
The function clean_names() from the package janitor standardizes column names by transliterating to unique ASCII names by doing the following:
case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)replace = argument (e.g. replace = c(onset = “date_of_onset”))# send the dataset through the function clean_names()
linelist <- linelist_raw %>%
janitor::clean_names()
# see the new names
names(linelist)
## [1] "row_num" "case_id" "generation" "infection_date"
## [5] "date_onset" "hosp_date" "date_of_outcome" "outcome"
## [9] "gender" "hospital" "lon" "lat"
## [13] "infector" "source" "age" "age_unit"
## [17] "fever" "chills" "cough" "aches"
## [21] "vomit"Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome)Now you can see that the columns names have been changed:
CAUTION: This tab may follow from previous tabs.
Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.
Here are all the column names in the linelist:
names(linelist)
## [1] "row_num" "case_id" "generation"
## [4] "date_infection" "date_onset" "date_hospitalisation"
## [7] "date_outcome" "outcome" "gender"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "fever" "chills"
## [19] "cough" "aches" "vomit"select() you can do the following:Select only the columns you want to remain, and their order of appearance
# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>%
select(case_id, date_onset, date_hospitalisation, fever) %>%
names() # display the column names
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever"Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained.
Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.
linelist %>%
select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
names()
## [1] "row_num" "case_id" "generation"
## [4] "date_infection" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital"
## [10] "lon" "lat" "infector"
## [13] "source" "age" "age_unit"Re-order the columns - use everything() to signify all other columns not specified in the select() command:
# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>%
select(case_id, date_onset, date_hospitalisation, gender, everything()) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "gender" "row_num" "generation"
## [7] "date_infection" "date_outcome" "outcome"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "fever" "chills"
## [19] "cough" "aches" "vomit"As well as everything() there are several special functions that work within select(), namely:
everything() - all other columns not mentionedlast_col() - the last columnstarts_with() - matches to a specified prefix. Example: select(starts_with("date"))ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))contains() - columns containing a character string. Example: select(contains("time"))matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))num_range() -any_of() - matches if column is named. Useful if name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))where() - applies a function to all columns and selects those which are TRUEselect()to the cleaning pipe chain:In the linelist, there is one column we do not need: row_num. Remove it by adding a select() command to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-row_num)CAUTION: This tab may follow from previous tabs.
See section on object classes
Here we want to ensure that the class of each column is appropriate. First we run some checks on the classes of important columns.
The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!
The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!
Use table() or another method to see all the values, can see that we see that one date was entered in a different format (15 April 2014) than all the others!
##
## 15 April 2014 2010-05-07 2010-05-08 2010-05-27 2010-06-15
## 1 1 1 1 1
## 2010-06-18
## 1
This means before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data. Or, we can do this using mutate() and recode() in our cleaning pipe chain, before the commands to convert to class Date. LINK TO CLASSIFYING column AS DATE.
The new mutate line can be read as: mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE. Note that this pattern (OLD = NEW) is the opposite of most R patterns. The R development community is working on revising this.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-row_num) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>%
# correct the class of the columns
mutate(age = as.numeric(age),
date_onset = as.Date(date_onset, format = "%Y-%m-%d"))Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.
class(linelist$date_infection)
## [1] "POSIXct" "POSIXt"
head(linelist$date_infection)
## [1] "2014-04-09 UTC" NA NA "2014-05-07 UTC"
## [5] NA "2014-05-06 UTC"You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.
across() we also use the function where().is.character(), is.numeric(), and is.logical()) are from base Racross() are written without the empty parentheses ()CAUTION: This tab may follow from previous tabs.
After selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept. A simple example below filters to keep only the rows where case_id is not missing (three rows are removed):
A more complex example:
Run a simple histogram of onset dates to see that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.
If we simply filter linelist by date of onset (after June 2013) we may make a mistake! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!
DANGER: Filtering to greater than (>) or less than (<) a date can remove any rows with missing date values (NA)! This is because NA is treated as infinitely large and small.
We also know that this first epidemic occurred at Hospital A, Hospital B, and there were 10 cases at Connaught Hospital. Hospitals A & B did not have cases in the second epidemic, but Connaught Hospital had many. This is a complex filter to apply - it is wise to tabulate these columns to know exactly how many rows we expect should be removed.
Let’s examine a cross-tabulation to make sure we exclude only the correct rows:
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of the date_onset
useNA = "always") # show missing values
## YearOnset
## Hospital 2010 2011 2014 2015 <NA>
## Connaught Hopital 0 0 38 6 3
## Connaught Hospital 8 2 1322 327 66
## Hospital A 40 13 0 0 1
## Hospital B 41 12 0 0 1
## Military Hopital 0 0 20 8 2
## Military Hospital 0 0 580 168 38
## Mitylira Hopital 0 0 1 0 0
## Mitylira Hospital 0 0 59 17 3
## other 0 0 679 168 38
## Princess Christian Maternity Hopital (PCMH) 0 0 10 1 0
## Princess Christian Maternity Hospital (PCMH) 0 0 306 90 15
## Rokupa Hopital 0 0 10 1 0
## Rokupa Hospital 0 0 332 94 17
## <NA> 0 0 1106 296 67We want to exclude only the nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) rows from 2012 and 2013 at those three hospitals (A, B, and Connaught), including the 2 from Hospitals A & B with missing onset dates, but not any others with missing onset dates. We start with a linelist of nrow(linelist). Here is our statement:
linelist <- linelist %>%
filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))
nrow(linelist)
## [1] 5888When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, the 10 Connaught Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of the date_onset
useNA = "always") # show missing values
## YearOnset
## Hospital 2014 2015 <NA>
## Connaught Hopital 38 6 3
## Connaught Hospital 1322 327 66
## Military Hopital 20 8 2
## Military Hospital 580 168 38
## Mitylira Hopital 1 0 0
## Mitylira Hospital 59 17 3
## other 679 168 38
## Princess Christian Maternity Hopital (PCMH) 10 1 0
## Princess Christian Maternity Hospital (PCMH) 306 90 15
## Rokupa Hopital 10 1 0
## Rokupa Hospital 332 94 17
## <NA> 1106 296 67Multiple filter statements can be separated by commas, or you can always pipe to a separate filter() statement for clarity. Adding these filter statements to the cleaning pipe chain now looks like this:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-row_num) %>%
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>%
# correct the class of the columns
mutate(age = as.numeric(age),
date_onset = as.Date(date_onset, format = "%Y-%m-%d")) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
filter(!is.na(case_id), # keep only rows where case_id is not missing
date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))
) # close filterfilter(dataset, criteria) OR subset like: dataset_new <- dataset[criteria,criteria]
nrow(linelist %>% filter())
CAUTION: This tab may follow from previous tabs.
We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use plyr or also the base R style to create a new column.
mutate() (dplyr)The verb mutate() is used to add a new column or modify an existing one. Below are some example of creating new columns with mutate(). The syntax is: new_column_name = value or function. It is best practice to separate each new column with a comma and new line.
linelist <- linelist %>% # creating new, or modifying old dataset
mutate(new_var_dup = case_id, # new column = duplicate/copy another column
new_var_static = 7, # new column = all values the same
new_var_static = new_var_static + 5, # you can overwrite a column, and can modify a column multiple times
new_var_calc = (age / 12), # new column = a calculation
new_var_paste = paste0(hospital, " (", date_hospitalisation, ")") # new column = pasting together values from other columns
) Scroll to the right to see the new columns:
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.
To recode the values in a column, mutate() is also used.
For example, in the linelist we need to clean the values in the column “hospital”. There are several incorrect spelling, and many missing values.
table(linelist$hospital, useNA = "always")
##
## Connaught Hopital
## 47
## Connaught Hospital
## 1715
## Military Hopital
## 30
## Military Hospital
## 786
## Mitylira Hopital
## 1
## Mitylira Hospital
## 79
## other
## 885
## Princess Christian Maternity Hopital (PCMH)
## 11
## Princess Christian Maternity Hospital (PCMH)
## 411
## Rokupa Hopital
## 11
## Rokupa Hospital
## 443
## <NA>
## 1469To change spellings one-by-one, you can use the recode() function within the mutate function. The code is saying that the column “hospital” should be defined as the current column “hospital” but with certain changes (syntax is OLD = NEW). Don’t forget commas!
linelist <- linelist %>%
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
))
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1762
## Military Hospital
## 896
## Other
## 885
## Princess Christian Maternity Hospital (PCMH)
## 422
## Rokupa Hospital
## 454
## <NA>
## 1469TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.
TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").
To change missing values to a character value, such as “Missing”, use the function replace_na() in the same manner as recode above:
Likewise you can quickly convert character values to NA using na_if(), as below:
Intro For simple cases you can use ifelse() or if_else(). In most cases it is better to use case_when().
ifelse() and if_else():
These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if TRUE, value if FALSE). if_else() is a special version from dplyr that handles dates.
Stringing together ifelse statements - NOT ADVISED!! Difficult to read and keep track of.
IMAGE of ifelse string with X across is.
Use case-when() instead.
You can reference other columns with the ifelse() function within mutate():
Missing if… na_if() lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),
coalesce()
if_else(), ifelse()
recode CLEANING MISSPELLINGS HOSPITAL NAME
Replace
CAUTION: This tab may follow from previous tabs.
## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
filter(column %in% c(names(mll_raw), ".global"))
## define columns that are not cleand
unchanged <- c(
"epilink_relationship",
"narratives",
"epilink_relationship_detail"
)
mll_clean <- mll_raw %>%
## convert to tibble
as_tibble() %>%
## clean columns using cleaning rules
clean_data(
wordlists = mll_cleaning_rules,
protect = names(.) %in% unchanged
)CAUTION: This tab may follow from previous tabs.
Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html
Taken from website above:
Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>%
select(name, mass, species) %>%
group_by(species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.
If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new column by assigning it a value. In the command below, the column new_var does not exist until after the command is executed. In this simple example the column is assigned the static value “new value”, so for all rows the value will be “new value”.
You can also give the new column a dyanmic value as shown below, or using the case_when() command explained in the next tab.
case_when())case_when()If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!
linelist <- linelist %>%
mutate(hospital = case_when(hospital == "Connaught Hopital" ~ "Connaught Hospital",
hospital == "Rokupa Hopital" ~ "Rokupa Hospital",
hospital %in% c("Mitylira Hopital",
"Mitylira Hospital",
"Mitylira Hospital",
"Military Hopital") ~ "Military Hospital",
is.na(hospital) ~ "Missing",
hospital == "Princess Christian Maternity Hopital (PCMH)" ~ "Princess Christian Maternity Hospital (PCMH)",
TRUE ~ hospital)
)
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1762
## Military Hospital
## 896
## Missing
## 1469
## Other
## 885
## Princess Christian Maternity Hospital (PCMH)
## 422
## Rokupa Hospital
## 454
## <NA>
## 0CAUTION: This tab may follow from previous tabs.
TODO tutorial on using case_when()
For example, creating age groups cut()
case_when()
age_categories() (R4Epis package)
by percentile
WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)
CAUTION: This tab may follow from previous tabs.
Within a group, indicate/convert to the highest value in the group
Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)
across dplyr
dealing with missing data percent missing over time etc.
Or change in percent of anything (X) over time, really.
lines <- linelist %>%
mutate(date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"),
week = aweek::week2date(aweek::date2week(date_of_onset))) %>%
group_by(week) %>%
summarize(n_obs = n(),
dt_hosp_missing = sum(date_of_hospitalisation == "" | is.na(date_of_hospitalisation)),
dt_hosp_p_miss = dt_hosp_missing / n_obs,
outcome_missing = sum(outcome == "" | is.na(outcome)),
outcome_p_miss = outcome_missing / n_obs) %>%
reshape2::melt(id.vars = c("week")) %>%
filter(grepl("_p_", variable)) %>%
ggplot()+
geom_line(aes(x = week, y = value, group = variable, color = variable), size = 1, stat = "identity")+
labs(title = "Missingness in variables, as proportion of ",
#subtitle = str_glue("As of {format(report_date, '%d %b')}"),
x = "Week",
y = "Proportion missing",
fill = "CalREDIE Variable") +
scale_color_discrete(name = "Variable", labels = c("Date of Hospitalization Missing", "Outcome Missing"))+
scale_y_continuous(breaks = c(seq(0,1,0.1)))
#theme_cowplot()#+
#theme(legend.position = element_text("none"))
lines(pivoting/melting etc.) Transforming datasets from wide-to-long, or long-to-wide…
Transforming a dataset from wide to long
We start with data that is in a wide format, e.g. our linelist.
pivot_longer()dplyr pivot_wider()
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Tidyverse - grouping by values
.drop=F in group_by() command
Keep the title of this section as “Preparation”.
Data preparation steps such as:
group_by()aggregate()This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
antijoins as well
Sub-tabs if necessary. Re-name as needed.
rowmatcher other options (finlay?)
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Many audiences and reasons for the tables…
Keep the title of this section as “Preparation”.
Data preparation steps such as:
knitr::kable DT
For publication
quickly changing the denominator (per 100,000, etc.)
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
Note the argument na.rm=T, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
Frequency table of 1 and 2 categorical variables
table(linelist$province)
##
## Anhui Beijing Fujian Guangdong Hebei Henan Hunan Jiangsu
## 4 3 5 1 1 4 2 28
## Jiangxi Shandong Shanghai Taiwan Zhejiang
## 6 2 33 1 46
x <- table(linelist$province, linelist$gender)
#janitor::adorn_totals(x)A table with 3 variables
table_3vars <- table(linelist$province, linelist$gender, linelist$outcome)
ftable(table_3vars)
## Death Recover
##
## Anhui f 1 0
## m 1 1
## Beijing f 0 1
## m 0 1
## Fujian f 0 0
## m 0 3
## Guangdong f 0 0
## m 0 0
## Hebei f 1 0
## m 0 0
## Henan f 0 0
## m 1 3
## Hunan f 1 0
## m 0 1
## Jiangsu f 2 3
## m 2 7
## Jiangxi f 0 2
## m 1 1
## Shandong f 0 0
## m 0 2
## Shanghai f 3 3
## m 12 10
## Taiwan f 0 0
## m 0 0
## Zhejiang f 1 3
## m 5 5This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
quant/quant, quant/cat, cat/cat t-tests odds ratios, mantel-haensel, etc.
Keep the title of this section as “Preparation”.
Data preparation steps such as:
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Preparation”.
Data preparation steps such as:
This code chunk shows the loading of packages required for the analyses.
# Create vector of names of required packages:
packages_epicurve <- c("rio", # File import
"here", # File locator
"tidyverse", # data manipulation
"ggplot2", # Produce plots and graphs
"aweek", # working with dates
"lubridate", # Manipulate dates
"incidence", # an option for epicurves of linelist data
"stringr", # Search and manipulate character strings
"forcats", # working with factors
"RColorBrewer", # Color palettes from colorbrewer2.org
"DT" # produce tables for this html handbook
) ### close vector of packages
# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(packages_epicurve, character.only=TRUE)Two example datasets are used in this section:
If viewing in Google Chrome, you can access these datasets in Microsoft Excel by clicking HERE and HERE. TODO.
The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.
For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.
Do cleaning steps as necessary!!! To Do
Review the two datasets and notice the differences
Linelist dataset
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Aggregated counts dataset
You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). In this case, we set this date as 27 July 2013.
Now we can reference the object data_date into the code and have it reference that date.
Optionally, it can be nice to identify all the date variables and store their names in a vector. This can be done by individually naming them, or by searching for them by looking for keywords.
Method 1
# Method 1. Define date variables explicitly in a vector
DateVars <- c("date_onset",
"date_hospitalisation",
"date_outcome",
"date_infection"
)
DateVars
## [1] "date_onset" "date_hospitalisation" "date_outcome"
## [4] "date_infection"Method 2
# Method 2: Search for date columns
DateVars <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))
DateVars
## [1] "date_infection" "date_onset" "date_hospitalisation"
## [4] "date_outcome" "date_death"
#Note: other search tool options within vars_select include contains() ends_with(), starts_with(), or one_of()Verify that each variable was successfully converted to date class by printing statistics and a quick histogram for each one.
# To verify successful conversion of date variables
# Creates list of column numbers of date variables
varNums <- c()
for (varName in DateVars) {
varNum <- match(varName, names(linelist))
varNums <- c(varNums, varNum)
}
# Produce output for each date variable converted
for (varNum in varNums) {
varName <- names(linelist)[varNum] # get name of variable
class <- class(linelist[, varNum]) # get class
missing <- sum(is.na(linelist[, varNum])) # get number missing values
hist(linelist[, varNum], # histogram
breaks = 50,
main = paste0("Histogram of: ", varName, ", Class: ", class, ", Missing: ", missing),
xlab = varName)
}incidence packageBelow are tabs on using the “incidence” package
This section shows variations on the epicurve using the incidence package
These are simple epicurves using the incidence package. The epicurve is assigned to the object “epicurve”, which is then plotted. Remember that incidence::plot() is different to base::plot()
The interval defines how the observations are grouped. Options are all those in the package aweek, including but are not limited to:
* “Monday week” * “2 Monday weeks” * “Sunday week”
* “MMWRweek” (starts on Sunday)
* “Month”
* “Quarter”
* “Year”
First date and last date can also be specified.
# incidence object is created, with data aggregated at one day intervals
epicurve_daily <- incidence::incidence(linelist$date_onset, interval = "day")
# If weekly, you can specific the start day
epicurve_weekly <- incidence::incidence(linelist$date_onset, interval = "Monday week")
epicurve_3weekly <- incidence::incidence(linelist$date_onset, interval = "3 weeks")
# Monthly
epicurve_monthly <- incidence::incidence(linelist$date_onset, interval = "month")
# Plot the incidence object
plot(epicurve_daily)
plot(epicurve_weekly)
plot(epicurve_3weekly)
plot(epicurve_monthly)Behind the scenes, incidence is using ggplot(), so you can add aesthetic themes and other lines using the ggplot syntax.
# Set theme elements using ggplot syntax
epicurve_theme <- ggplot2::theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
legend.title = element_blank(),
panel.grid.major.x = element_line(color = "grey60", linetype = 3),
panel.grid.major.y = element_line(color = "grey60", linetype = 3)
)
# Sets labels using ggplot syntax
epicurve_labels <- labs(x = "Week",
y = "Cases (n)",
title = "H5N7 cases by week of onset",
caption = paste0("Source: Linelist data from: ", data_date, "; ", missing_onset, " are missing date of onset and not shown."))
# plot the epicurve with aesthetics
nice_plot <- plot(epicurve_weekly, show_cases = TRUE, border = "black", n_breaks = nrow(epicurve_weekly)) +
scale_y_continuous(expand = c(0, 0)) + # set origin for axes
# add labels
epicurve_labels +
# add theme
epicurve_theme
nice_plot
# Modify nice_plot to show only 6 breaks in the x-axis
nice_plot + scale_x_incidence(epicurve_weekly, n_breaks = 6)Now differentiating the cases by gender, using the groups = argument in the incidence command
# Create epiweek object, with counts grouped by gender
epicurve_weekly_gender <- incidence(linelist$date_onset,
interval = "week",
groups = linelist$gender,
na_as_group = FALSE) # Prevents missing values from being assigned their own group
# Plot the epicurve
# Note: Remove the boxes around each case as it makes gender colours hard to see! (show_cases = FALSE)
nice_plot <- plot(epicurve_weekly_gender, show_cases = FALSE, border = "black", n_breaks = nrow(epicurve_weekly_gender)) +
# add labels (defined in previous section)
epicurve_labels +
# add theme elements
epicurve_theme
nice_plotTo filter data, This version is filtered to only show data from a specific province.
# filter the dataset and pass it to the incidence() function
Connaught_data <- filter(linelist, hospital == "Connaught Hospital")
epicurve_Connaught <- incidence(Connaught_data$date_onset,
interval = "week",
groups = Connaught_data$gender)
# Re-sets labels, changing title to reflect subset
epicurve_labels <- labs(x = "Week",
y = "Cases",
title = "Ebola cases by week of onset in Connaught Hospital",
caption = paste0("Source: Linelist data from: ", data_date, "; ", missing_onset, " are missing date of onset and not shown."))
# plot as before
plot(epicurve_Connaught, show_cases = TRUE, border = "grey") +
# add labels (defined in previous section)
epicurve_labels +
# add theme elements
epicurve_themeggplot2Below are tabs on using “ggplot2” package
# Daily case counts
###################
plot_daily <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by day (1 days)
stat_bin(binwidth = 1, position="stack")
print(plot_daily)
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position="stack", fill = "brown")
print(plot_weekly)# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_onset, factor = TRUE, numeric = TRUE)
# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
# X-axis 21-day labels
scale_x_date(
# Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm=T), by = "1 week"),
# axis limits determined by max/min + buffer
limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)),
# displays as date number, then abbreviated month (e.g. 12 Oct)
date_labels = "%d-%b",
# sets origin at (0,0)
expand = c(0,0)) +
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 5),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic")
) +
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "subtitle here",
caption = paste0(nrow(linelist),
" confirmed and probable cases, reported as of ", data_date, ". ",
missing_onset, " cases missing date of onset and not shown.")) +
ggtitle("Epidemic curve")
# print
print(plot_weekly)Colored by a category
# Setup
########
# Two known classes (select colors from colorbrewer2.org)
colors_overall = c("#d95f02", #
"#1b9e77",
"#7570b3") #
# Order sex variable by reverse # of cases, so plot stacks with smallest # of cases at top
linelist$gender <- factor(linelist$gender,
levels = levels(fct_rev(fct_infreq(linelist$gender))))
# Calculates maximum yaxis height for uniformity between the two graphs
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(linelist[is.na(linelist$date_onset),])
# PLOT - BY ONSET DATE
######################
plot_defined_cats <- ggplot(linelist, aes(x = date_onset, fill = gender)) +
# stacked bars, width of 7 days
stat_bin(binwidth = 7, position = "stack") +
# Colors and labels of confirmed/probable
scale_fill_manual(values = rev(colors_overall),
labels = str_to_sentence(levels(factor(linelist$gender)))) +
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases",
caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
# Title
ggtitle("Cases by week of illness onset")
# print
print(plot_defined_cats)# PARAMETERS
#############
# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))
# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_onset > (data_date - 90))])
active_zones
# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
filter(province %in% active_zones) %>%
group_by(province) %>%
summarise(cases = n())
order_table
# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)
# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))
# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
# Value assignments
# Non-active zones
linelist$active_zone == FALSE ~ "Non-active zones",
# All others are assigned their names, capitalized
TRUE ~ stringr::str_to_title(linelist$province)),
# Order of variable levels
levels = c(
# "Non-active zones" is first level
"Non-active zones",
# Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))
table(linelist$graph_zone, useNA = "ifany")
# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1
# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
"#1b9e77", # turquoise green
"#ff7f00", # orange
"#ffff33", # yellow
"#6a3d9a", # purple
"#b15928", # brown
"#1f78b4", # blue
"#e31a1c", # red,
"#fb9a99", # pink
"#b2df8a", # light green
"#cab2d6", # light purple
"#a6cee3", # light blue
"#fdbf6f", # beige
"#33a02c" # green
)
# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))
# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_onset, fill = graph_zone)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack") +
# Fill of bars
scale_fill_manual(values = colors_linelist,
labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
# X-axis 21-day labels
scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date number, then abbreviated month (e.g. 12 Oct)
expand = c(0,0)) + # sets origin at (0,0)
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 5),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"), # title size, font, bold
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.caption = element_text(hjust = 0, face = "italic")
) +
# Legend specifications
theme(legend.title = element_blank(), # No legend title
legend.position = c(0.20, 0.85), # placement of legend
legend.background = element_blank(), # legend background
legend.text = element_text(size=12)) + # legend text size
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "Health zones with cases in the last 42 days specified by color",
caption = paste0(nrow(linelist),
" confirmed and probable cases, reported as of ", data_date, ". ",
missing_onset, " cases missing date of onset and not shown.",
"\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
ggtitle("Epidemic curve by active health zones")
# print
plot_overall
#SETUP
#############
# Filter to health zone of interest
zone_data <- linelist
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))
# Assign health area groups (individual for HAs of interest, groups others together)
linelist$graph_areas <- factor(case_when(
linelist$province == "Shanghai" ~ "Shanghai",
linelist$province == "Jiangsu" ~ "Jiangsu",
linelist$province == "Zhejiang" ~ "Zhejiang",
TRUE ~ "Other (10)"
),
# Levels part of the factor function assigns order of appearance
levels = c(
"Other (10)",
"Shanghai",
"Jiangsu",
"Zhejiang"
)
)
# checks
table(linelist$graph_areas, useNA = "ifany")
# Color assignments
colors_needed <- length(unique(linelist$graph_areas, na.rm=T)) - 1 # number of colors needed
# list of colors
colors_aire = c("#a6cee3",
"#1f78b4",
"#b2df8a",
"#33a02c",
"#fb9a99",
"#e31a1c",
"#fdbf6f",
"#ff7f00",
"#cab2d6",
"#6a3d9a",
"#ffff99",
"#b15928"
)
# Reduce number of colors to only the number needed
colors_aire <- c("#cccccc", rev(colors_aire[1:colors_needed]))
# Plot of province
#####################################
plot <- ggplot(linelist, aes(x = date_onset, fill = graph_areas)) +
stat_bin(binwidth = 7, position="stack") +
scale_fill_manual(values = colors_aire, labels = str_to_sentence(levels(factor(linelist$graph_areas)))) +
scale_x_date(date_breaks = "1 week", date_labels = "%d-%b", limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), expand=c(0,0)) + # I used the date onset variable here so x axes will be the same
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
plot.caption = element_text(hjust = 0, face = "italic"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
axis.title = element_text(size = 14, face = "bold"),
legend.title = element_blank(),
legend.justification = c(0,1),
legend.position = c(0.05, 1),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 4)) +
labs(x="Week of illness onset",
y="Number of cases",
subtitle = "",
caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset.")) +
ggtitle("Cases of influenza, by province")
plot# Define the waves
##################
# zone_data <- filter(linelist, zone_de_sante == "mabalako")
#
# zone_data$wave <- case_when(
# zone_data$date_onset >= as.Date("2018-03-01") &
# zone_data$date_onset < as.Date("2018-10-25") ~ "Wave 1",
#
# zone_data$date_onset >= as.Date("2018-10-25") &
# zone_data$date_onset < as.Date("2019-02-01") ~ "Wave 2",
#
# zone_data$date_onset >= as.Date("2019-02-01") &
# zone_data$date_onset < as.Date("2019-09-15") ~ "Wave 3",
#
# zone_data$date_onset >= as.Date("2019-09-15") ~ "Wave 4",
#
# TRUE ~ NA_character_
# )
#
# table(is.na(zone_data$date_onset))
# table(zone_data$wave, useNA = "always")
#
#
# # Color assignments
# colors_needed <- length(unique(zone_data$wave, na.rm=T)) # number of colors needed
#
# # list of colors
# colors_aire = c("#a6cee3",
# "#1f78b4",
# "#b2df8a",
# "#33a02c",
# "#fb9a99",
# "#e31a1c",
# "#fdbf6f",
# "#ff7f00",
# "#cab2d6",
# "#6a3d9a",
# "#ffff99",
# "#b15928"
# )
#
# # Reduce number of colors to only the number needed
# colors_aire <- c(rev(colors_aire[1:colors_needed]))
#
#
# # Plot of health zone colored by wave
# #####################################
# plot_Mabalako <- ggplot(zone_data, aes(x = date_onset, fill = wave)) +
#
# stat_bin(binwidth = 7, position = "stack") +
#
# scale_fill_manual(values = rev(colors_aire), labels = str_to_sentence(levels(factor(zone_data$wave)))) +
#
# scale_x_date(date_breaks = "21 days", date_labels = "%d-%b",
# limits = c((min(zone_data$date_onset, na.rm = T) - 8), (max(zone_data$date_report, na.rm = T) + 8)), expand = c(0,0)) +
#
# scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, 35), expand = c(0, 0)) +
#
# theme(text = element_text(family = "Segoe Condensed"),
# axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
# axis.text = element_text(size = 12),
# axis.title = element_text(size = 14, face = "bold"),
# axis.line = element_line(colour = "black"),
#
# plot.title = element_text(size = 20, hjust = 0.5, face = "bold"),
# plot.caption = element_text(hjust = 0, face = "italic"),
#
# panel.grid.major = element_blank(),
# panel.grid.minor = element_blank(),
# panel.background = element_blank(),
#
# legend.title = element_blank(),
# legend.justification = c(0,1),
# legend.position = c(0.75, 0.98),
# legend.background = element_blank(),
# legend.text = element_text(size=12)) +
#
# guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2), ncol = 1)) +
#
# labs(x="Week of illness onset",
# y="Number of cases",
# subtitle = "",
# caption = paste0(nrow(zone_data), " confirmed and probable cases, as of ", data_date, ". \n", missing_onset, " cases excluded due to missing date of onset and 16 excluded due to uncertain health zone of report.")) +
#
# ggtitle("Four waves of EVD in Mabalako health zone")
#
# plot_Mabalako
#
#
# # Produce table describing each wave
# ####################################
# table <- zone_data %>%
# select("aire_de_sante", "wave", "community_death", "date_onset", "cte_date", "epicasedef", "community_death", "contact_registered", "contact_surveilled") %>%
# group_by(wave) %>%
# summarise(first_onset = min(date_onset, na.rm = T),
# last_admission = max(cte_date, na.rm = T),
# n = n(),
# confirmed = sum(epicasedef == "confirmed"),
# community_deaths = paste0(sum(community_death == 1),
# " (", round(100*sum(community_death == 1)/confirmed),"%)"),
# reg_contacts = paste0(sum(contact_registered == "yes"),
# " (", round(100*sum(contact_registered == "yes")/confirmed),"%)"),
# surv_contacts = paste0(sum(contact_surveilled == "yes"),
# " (", round(100*sum(contact_surveilled == "yes")/confirmed),"%)"),
# top = paste(toupper(names(sort(table(aire_de_sante),decreasing=TRUE)[1:3])), collapse=", ",
# round(100*(sort(table(aire_de_sante),decreasing=TRUE)[1:3]/confirmed)), "%"),
# health_areas = paste(toupper(unique(aire_de_sante)), collapse=', ')
# )
#
# kable(table)Often you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.
This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.
Note: The incidence package does not support aggregate data
As before, we must ensure date variables are correctly classified.
# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$date_hospitalisation, # use the Date variable
week_start = "Monday", # epiweek begins on Monday
floor_day = TRUE, # only display year and week #
factor = TRUE) # expand to include all possible weeksggplot(data = count_data, aes(x = as.Date(epiweek), y = n_cases, group = hospital, fill = hospital))+
geom_bar(stat = "identity")+
# LABELS for x-axis
scale_x_date(date_breaks = "1 month", # displays by month
date_labels = '%b%d\n%Y')+ #labeled by month with year below
# Choose color palette (uses RColorBrewer package)
scale_fill_brewer(palette = "Pastel1")+
# Theme specifications (axis, text, etc.)
theme(
# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic"))+
# labels
labs(x = "Week of report",
y = "Number of cases",
subtitle = "Cases aggregated by week and shown by hospital",
caption = "Data source: XXXXX")+
ggtitle("Epidemic curve of disease X in fictional location")Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.
In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.
This uses the linelist dataset
TODO not complete yet
library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
mutate(onset_epiweek = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
group_by(onset_epiweek) %>%
summarize(num_male = sum(gender == "m"),
num_female = sum(gender == "f"),
pct_male = round(100*(num_male / n())),
med_age = median(as.numeric(age), na.rm=T)
)
# remove pct and melt
linelist_week_melted <- linelist_week %>%
select(-c("pct_male", "med_age")) %>%
melt(id.vars = c("onset_epiweek"))
# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
linelist_week,
by = "onset_epiweek")
second_axis <- ggplot(linelist_week_melted,
aes(x = as.Date(onset_epiweek),
y = value, group = variable,
fill = variable)) +
# bars
geom_bar(stat = "identity")+
# Colors and labels of confirmed/probable
scale_fill_manual(values = c("blue", "red"),
labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
scale_color_manual(values = "black")+
scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases",
caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
# Title
ggtitle("Cases by week of illness onset")
second_axis
# print
print(plot_defined_cats)This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
ggplot2This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Boxplots can be created with:
boxplot() function from the graphics package (installed automatically with base R), orggplot() function from the ggplot2 packageKeep the title of this section as “Preparation”.
Data preparation steps such as:
boxplot()Some options with boxplot() shown below are:
# boxplot of one numeric variable
boxplot(linelist$age, # numeric variable
main="boxplot", # main title
xlab="Suppliment and Dose") # x-axis label
# by group (formula style)
boxplot(age ~ gender, data=linelist, notch=TRUE, main="boxplot", xlab="Suppliment and Dose")You can have multiple levels of group (e.g. age by outcome AND gender)
Notched “violin plots” are possible. The notch represents the median and X around it (TODO)
# By subgroup (age by outcome AND gender)
boxplot(age ~ outcome * gender,
data=linelist,
col=c("gold","darkgreen"), # colors, in a vector
main="Boxplot by Outcome and Gender", # main title
xlab="Suppliment and Dose") # x-axis label
# Notched (violin plot), and varying width
boxplot(age ~ outcome * gender,
data=linelist,
notch=TRUE, # notch at median
varwidth = TRUE, # width varying by sample size
col=(c("gold","darkgreen")),
main="Notched boxplot, width varying by sample size",
xlab="Suppliment and Dose")
# Horizontal
boxplot(age~outcome,
data=linelist,
horizontal=TRUE, # flip to horizontal
col=(c("gold","darkgreen")),
main="Horizontal boxplot",
xlab="Suppliment and Dose")ggplot()Some options with ggplot() shown below are:
# Simple boxplot of one numeric variable
ggplot(data = linelist, aes(y = age))+ # only y variable given (no x variable)
geom_boxplot()+
ggtitle("Simple ggplot() boxplot")
# By group
ggplot(data = linelist, aes(y = age, # numeric variable
x = outcome, # group variable
fill = outcome))+ # fill variable (color of boxes)
geom_boxplot()+ # create the boxplot
ggtitle("ggplot() boxplot by gender") # main title
# Removing missing values, and add color
ggplot(data = linelist %>% filter(!is.na(outcome)), # dataset piped through a filter to retain rows where gender is not missing
aes(y = age, x = outcome, fill= outcome))+ # boxes filled according to gender value
geom_boxplot()+
ggtitle("ggplot() boxplot by gender (missing excluded)")To examine by subgroups, use facet_wrap() (for more see section on ggplot tips).
# By subgroup
ggplot(data = linelist %>% filter(!is.na(gender)), # dataset piped through a filter to retain rows where gender is not missing
aes(y = age, x = outcome, fill=outcome))+
geom_boxplot()+
ggtitle("A ggplot() boxplot")+
facet_wrap(~gender)“Violin plots” can be made simply or very complex:
# Violin plots
ggplot(linelist, aes(x=age, y=outcome, fill = outcome)) +
geom_violin(trim=FALSE)
# Vertical violin plot
ggplot(linelist, aes(x=age, y=outcome, fill = outcome)) +
geom_violin(trim=FALSE)+
coord_flip()
# Add jittered points
ggplot(linelist, aes(x=age, y=outcome, fill = outcome)) +
geom_violin(trim=FALSE)+
coord_flip()+
geom_jitter(shape=16, # points
position=position_jitter(0.2)) # jitter permissible to avoid point overlapThis tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Calculating Visualizing
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The primary tool to visualize and analyze transmission chains is the package epicontacts, developed by the folks at RECON.
links <- epicontacts::make_epicontacts(linelist = mers_korea_2015$linelist,
contacts = mers_korea_2015$contacts,
directed = TRUE)
# plot without time
plot(links,
selector = FALSE,
height = 700,
width = 700)And in a transmission tree, with date of onset on the x-axis:
Note: this currently requires installing a development version of epicontacts from github… @ttree
summary(links)
##
## /// Overview //
## // number of unique IDs in linelist: 162
## // number of unique IDs in contacts: 97
## // number of unique IDs in both: 97
## // number of contacts: 98
## // contacts with both cases in linelist: 100 %
##
## /// Degrees of the network //
## // in-degree summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 1.0000 0.6049 1.0000 3.0000
##
## // out-degree summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.6049 0.0000 38.0000
##
## // in and out degree summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 1.21 1.00 39.00
##
## /// Attributes //
## // attributes in linelist:
## age age_class sex place_infect reporting_ctry loc_hosp dt_onset dt_report week_report dt_start_exp dt_end_exp dt_diag outcome dt_death
##
## // attributes in contacts:
## exposure diff_dt_onsetThis tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Endemic corridor analysis Detecting spikes in syndromic/routine surveillance
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Why How When etc.
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
E.g. EVD patient “pathways” to outcome (via clinic or not, etc.)
HIV care continuum datasets? PreP datasets?
Sankey plots - show transitions among cohort over time, interrelatedness of groups Liza Coyer TODO
Or papers in meta-analysis
E.g. border closures during COVID
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Age pyramids can be useful to show patterns by age group. They can show gender, or the distribution of other characteristics.
These tabs demonstrate how to produce age pyramids using:
ggplot()via ggplot and via R4Epis methods TODO
To make a traditional age/sex demographic pyramid, the data must first be cleaned in the following ways:
First, load the packages required for this analysis:
pacman::p_load(rio, # to import data
here, # to locate files
tidyverse, # to clean, handle, and plot the data (includes ggplot package)
apyramid # a package dedicated to creating age pyramids
) Now import the data
Sometimes, age variables will import as class “character”. This occurs most often if there are non-numeric characters in some of the values, for example entries of “2 months”, or how in some countries commas are used in the decimals place (e.g. “4,5” to mean four and one half years)
In other circumstances, a numeric age variable may be adjacent to another “units” variable (values of either years or months). This will require additional cleaning to calculate a completely numeric age. You can find an example of this in the cleaning page (LINK).
#check the class of the linelist variable age
class(linelist$age)
## [1] "character"
# ensure class of age variable is numeric, by re-defining itself
linelist$age <- as.numeric(linelist$age)Now create age groups. There are a few ways to do this. The most simple way is using the base function cut(), which creates groups from a numeric variable.
First you give the numeric variable to be cut (age), the the breaks argument which is a vector (c()) of number break points.
By default, the grouping occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break), but includes B (the upper break). You can reverse this behavior by providing the right argument and setting equal to FALSE (explanation below).
# Create new variable, by cutting the numeric age variable
linelist$age_group <- cut(linelist$age,
breaks = c(0, 5, 10, 15, 20, 30, 45, 100))
# tabulate the number of observations per group
table(linelist$age_group)
##
## (0,5] (5,10] (10,15] (15,20] (20,30] (30,45] (45,100]
## 4 3 1 0 5 20 101You may want to verify that each age was assigned to the correct group. You can do this by cross-tabulating the age and age_group variables (names added for clarity). For example, you can check that the one 15-year patient was correctly assigned.
table(age = linelist$age, age_group = linelist$age_group, useNA = "always")
## age_group
## age (0,5] (5,10] (10,15] (15,20] (20,30] (30,45] (45,100] <NA>
## 2 1 0 0 0 0 0 0 0
## 4 3 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 0
## 7 0 1 0 0 0 0 0 0
## 9 0 1 0 0 0 0 0 0
## 15 0 0 1 0 0 0 0 0
## 21 0 0 0 0 1 0 0 0
## 25 0 0 0 0 1 0 0 0
## 26 0 0 0 0 2 0 0 0
## 27 0 0 0 0 1 0 0 0
## 31 0 0 0 0 0 2 0 0
## 32 0 0 0 0 0 2 0 0
## 34 0 0 0 0 0 1 0 0
## 35 0 0 0 0 0 2 0 0
## 36 0 0 0 0 0 2 0 0
## 37 0 0 0 0 0 2 0 0
## 38 0 0 0 0 0 5 0 0
## 41 0 0 0 0 0 1 0 0
## 43 0 0 0 0 0 2 0 0
## 45 0 0 0 0 0 1 0 0
## 47 0 0 0 0 0 0 1 0
## 48 0 0 0 0 0 0 3 0
## 49 0 0 0 0 0 0 1 0
## 50 0 0 0 0 0 0 2 0
## 51 0 0 0 0 0 0 2 0
## 52 0 0 0 0 0 0 1 0
## 53 0 0 0 0 0 0 2 0
## 54 0 0 0 0 0 0 6 0
## 55 0 0 0 0 0 0 2 0
## 56 0 0 0 0 0 0 6 0
## 57 0 0 0 0 0 0 1 0
## 58 0 0 0 0 0 0 3 0
## 59 0 0 0 0 0 0 1 0
## 60 0 0 0 0 0 0 4 0
## 61 0 0 0 0 0 0 2 0
## 62 0 0 0 0 0 0 4 0
## 64 0 0 0 0 0 0 4 0
## 65 0 0 0 0 0 0 4 0
## 66 0 0 0 0 0 0 3 0
## 67 0 0 0 0 0 0 3 0
## 68 0 0 0 0 0 0 4 0
## 69 0 0 0 0 0 0 5 0
## 70 0 0 0 0 0 0 1 0
## 72 0 0 0 0 0 0 3 0
## 73 0 0 0 0 0 0 1 0
## 74 0 0 0 0 0 0 5 0
## 75 0 0 0 0 0 0 2 0
## 76 0 0 0 0 0 0 3 0
## 77 0 0 0 0 0 0 2 0
## 78 0 0 0 0 0 0 1 0
## 79 0 0 0 0 0 0 5 0
## 80 0 0 0 0 0 0 3 0
## 81 0 0 0 0 0 0 2 0
## 83 0 0 0 0 0 0 2 0
## 84 0 0 0 0 0 0 1 0
## 85 0 0 0 0 0 0 1 0
## 86 0 0 0 0 0 0 2 0
## 87 0 0 0 0 0 0 1 0
## 89 0 0 0 0 0 0 1 0
## 91 0 0 0 0 0 0 1 0
## <NA> 0 0 0 0 0 0 0 2NOTE: Note that if you provide a highest break value that is too low, values may be excluded accidentally! You can write code that automatically adapts, by replacing the static number with the max() function, as below. Don’t forget to include na.rm argument to max() so that missing values are excluded from the maximum calculation (see LINK to description of max function).
linelist$age_group <- cut(linelist$age,
breaks = c(0, 5, 10, 15, 20, 30, 45, max(linelist$age, na.rm=T)))
table(linelist$age_group)
##
## (0,5] (5,10] (10,15] (15,20] (20,30] (30,45] (45,91]
## 4 3 1 0 5 20 101It may be important to also include the argument include.lowest and set it equal to TRUE, so that any values of 0 are still included in the lowest group (this could apply if infants have been coded as age 0). Customized group labels can be added manually with the labels argument. Both are shown below:
linelist$age_group <- cut(linelist$age,
breaks = c(0, 5, 10, 15, 20, 30, 45, 100),
include.lowest = TRUE,
labels = c("0-5", "6-10", "11-15", "16-20", "21-30", "31-45", "46-100"))
table(linelist$age_group)
##
## 0-5 6-10 11-15 16-20 21-30 31-45 46-100
## 4 3 1 0 5 20 101If you include the argument right and set it equal to TRUE, then the lower break points will be included in each group and the upper breaks will not be included in the group. Note how one patient moves from the 3rd group to the 4th group.
Note: The include.lowest argument will now apply to the highest break point (60), not the lowest (0).
linelist$age_group <- cut(linelist$age,
breaks = c(0, 5, 10, 15, 20, 30, 45, 100), # same breaks
right = FALSE, # change the inclusion
include.lowest = TRUE, # same, but now applies to highest break
labels = c("0-4", "5-9", "10-14", "15-19", "20-29", "30-44", "45-100")) # now the labels must change
table(linelist$age_group)
##
## 0-4 5-9 10-14 15-19 20-29 30-44 45-100
## 4 3 0 1 5 19 102If you want a fast way to make breaks and labels, you can use this:
# TO DO - NOT CORRECT
# Make groups from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
# Make labels for the above
age_lab = paste0(age_seq, "-", age_seq + 4)
length(age_seq)
## [1] 19
length(age_lab)
## [1] 19
# # Use these in the cut() command
# linelist$age_group <- cut(linelist$age,
# breaks = age_seq,
# labels = age_lab)The package apyramid allows you to quickly make an age pyramid. For more nuanced situations, see the tab on using ggplot() to make age pyramids. You can read more about the apyramid package by entering ?age_pyramid in your R console.
First we load the package (if not already done)
Now, using the cleaned dataset (see previous tabs), we can create an age pyramid with just one command.
linelist dataframeNote: If the split_by variable is bivariate (e.g. male/female, or yes/no), then the result will show up as a pyramid, otherwise, it will be presented as a facetted barplot with with empty bars in the background indicating the range of the un-facetted data set. Values of split_by will show up as labels at top of each facet.
By default, the bars display counts (not %), a dashed mid-line for each group is shown, the colors are green/purple, and missing values are not shown. These can all be adjusted, as shown below:
apyramid::age_pyramid(data = linelist,
age_group = "age_group",
split_by = "gender",
proportional = TRUE, # show percents, not counts
na.rm = FALSE # show a bar for patients missing age
) You can always add additional
ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:
apyramid::age_pyramid(data = linelist,
age_group = "age_group",
split_by = "gender")+
theme_minimal()+ # this is a standard ggplot() theme that simplifies the aesthetic background
labs(y = "Counts", # note that for age pyramids the x and y labels are switched.
x = "Age Groups", # Read more about this in the ggplot() age pyramid tab
caption = "My data source and caption here",
title = "Title of my plot",
subtitle = "Subtitle but with \n2 lines...")+
theme(axis.text = element_text(size = 10, face = "bold"), # see ggplot() tips page for details
axis.title = element_text(size = 12, face = "bold"))age_pyramid() with aggregated dataIf your data are already in counts by age group, you can still use the apyramid package, as shown below.
Let’s say that your dataset already looks like this:
agg_age_data <- linelist %>%
group_by(age_group, gender) %>%
summarize(cases = dplyr::n())
DT::datatable(agg_age_data)apyramid::age_pyramid(data = agg_age_data,
age_group = "age_group",
split_by = "gender",
count = "cases", # give the column name for the aggregated counts
na.rm = TRUE # show a bar for patients missing age
)ggplot()Using ggplot() to build your age pyramid allows more flexibility, but requires more effort and understanding of how ggplot() works.
The first thing to understand is that initially the age groups are on the x-axis. You are creating a geom_histogram() layer for each of the two genders, one in positive count values and the other in negative count values. Then you are using the coord_flip() command to switch the X and Y axes.
TODO - automate the axis limits…
## make the actual plot
age_sex_pyramid <- linelist %>%
ggplot(aes(x = age, fill = gender)) +
geom_histogram(data = filter(linelist, gender == "f"),
breaks = age_seq,
colour = "white",
closed = "left") +
geom_histogram(data = filter(linelist, gender == "m"),
breaks = age_seq,
aes(y=..count..*(-1)),
color = "white",
closed = "left") +
scale_x_continuous(breaks = age_seq + 2.5, labels = age_lab, limits = c(0, 90)) +
scale_y_continuous(limits = c(-20, 20), breaks = seq(-20,20,1), labels = abs(seq(-20, 20, 1))) +
scale_fill_manual(
values = c("f" = "orange",
"m" = "darkgreen"),
labels = c("Female", "Male"),
) +
labs(
x = "Age group",
y = "Number of cases",
fill = NULL,
caption = glue::glue("Data are from linelist X \n n = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age))} cases \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
coord_flip() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust=0, size=11)) +
ggtitle(paste0("Age and gender of cases"))
print(age_sex_pyramid)With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the true population pyramid. This can provide a nice visualization to compare the observed counts with the baseline.
TO DO
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Nice to use when tracking metrics at many facilities/regions over time
For example:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
DT::datatable(facility_count_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))Group the data by week and location, and then make summary calculations:
# Get number & percent of days data reported for, by week
agg_weeks <- facility_count_data %>%
mutate(week = as.Date(aweek::week2date(aweek::date2week(data_date, floor_day = T)))) %>%
group_by(location_name, week) %>%
summarize(n_days = 7,
n_reports = n(),
malaria_tot = sum(malaria_tot, na.rm = T),
n_days_reported = length(unique(data_date)),
p_days_reported = round(100*(n_days_reported / n_days))) %>%
filter(week < as.Date("2019-06-10"))
### Days
agg_days <- facility_count_data %>%
filter(data_date < as.Date("2019-06-10"))
Then we make the plot:
ggplot(agg_weeks, aes(x=week, y=location_name, fill= p_days_reported))+
geom_tile(colour="white",size=0.2)+
guides(fill=guide_legend(title="Reporting\nperformance (%)"))+
labs(x="Week (date of data)",
y="Facility name",
title="Percent of days per week that facility reported data",
subtitle = "52 health facilities, April-May 2019",
caption = "7-day weeks beginning on Mondays.")+
scale_fill_gradient(low = "yellow", high = "darkgreen", na.value = "grey80")+
theme_light()+
theme(legend.position="right",legend.direction="vertical",
legend.title=element_text(size=12, face="bold"),
legend.margin=margin(grid::unit(0,"cm")),
legend.text=element_text(size=10,face="bold"),
legend.key.height=grid::unit(0.8,"cm"),
legend.key.width=grid::unit(0.2,"cm"),
axis.text.x=element_text(size=12),
axis.text.y=element_text(vjust=0.2),
axis.ticks=element_line(size=0.4),
axis.title=element_text(size=12, face="bold"),
plot.background=element_blank(),
panel.border=element_blank(),
plot.margin=margin(0.7,0.4,0.1,0.2,"cm"),
plot.title=element_text(hjust=0,size=14,face="bold"))Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
This analysis plots the frequency of different combinations of values/responses. In this example, we plot the frequency of symptom combinations.
This analysis is often called:
Multiple response analysis Sets analysis Combinations analysis
The first method shown uses the package ggupset, an the second using the package UpSetR.
An example plot is below. Five symptoms are shown. Below each vertical bar is a line and dots indicating the combination of symptoms reflected by the bar above. To the right, horizontal bars reflect the frequency of each individual symptom.
This linelist includes five “yes/no” variables on reported symptoms. We will need to transform these variables a bit to use the ggupset package to make our plot.
View the data (scroll to the right to see the symptoms variables)
We convert the “yes” and “no the the actual symptom name. If”no", we set the value as blank.
# create column with the symptoms named, separated by semicolons
linelist_sym_1 <- linelist_sym %>%
# convert the "yes" and "no" values into the symptom name itself
mutate(fever = case_when(fever == "yes" ~ "fever", # if old value is "yes", new value is "fever"
TRUE ~ NA_character_), # if old value is anything other than "yes", the new value is NA
chills = case_when(chills == "yes" ~ "chills",
TRUE ~ NA_character_),
cough = case_when(cough == "yes" ~ "cough",
TRUE ~ NA_character_),
aches = case_when(aches == "yes" ~ "aches",
TRUE ~ NA_character_),
shortness_of_breath = case_when(shortness_of_breath == "yes" ~ "shortness_of_breath",
TRUE ~ NA_character_))Now we make two final variables:
1. Pasting together all the symptoms of the patient (character variable)
2. Convert the above to class list, so it can be accepted by ggupset to make the plot
linelist_sym_1 <- linelist_sym_1 %>%
mutate(
# combine the variables into one, using paste() with a semicolon separating any values
all_symptoms = paste(fever, chills, cough, aches, shortness_of_breath, sep = "; "),
# make a copy of all_symptoms variable, but of class "list" (which is required to use ggupset() in next step)
all_symptoms_list = as.list(strsplit(all_symptoms, "; "))
)View the new data. Note the two columns at the end - the pasted combined values, and the list
ggupsetLoad required package to make the plot (ggupset)
Create the plot:
ggplot(linelist_sym_1,
aes(x=all_symptoms_list)) +
geom_bar() +
scale_x_upset(reverse = FALSE,
n_intersections = 10,
sets = c("fever", "chills", "cough", "aches", "shortness_of_breath")
)+
labs(title = "Signs & symptoms",
subtitle = "10 most frequent combinations of signs and symptoms",
caption = "Caption here.",
x = "Symptom combination",
y = "Frequency in dataset")More information on ggupset can be found online or offline in the package documentation in your RStudio Help tab.
UpSetRThe UpSetR package allows more customization, but it more difficult to execute:
https://github.com/hms-dbmi/UpSetR read this https://gehlenborglab.shinyapps.io/upsetr/ Shiny App version - you can upload your own data https://cran.r-project.org/web/packages/UpSetR/UpSetR.pdf documentation - difficult to interpret
Convert symptoms variables to 1/0.
# Make using upSetR
linelist_sym_2 <- linelist_sym %>%
# convert the "yes" and "no" values into the symptom name itself
mutate(fever = case_when(fever == "yes" ~ 1, # if old value is "yes", new value is "fever"
TRUE ~ 0), # if old value is anything other than "yes", the new value is NA
chills = case_when(chills == "yes" ~ 1,
TRUE ~ 0),
cough = case_when(cough == "yes" ~ 1,
TRUE ~ 0),
aches = case_when(aches == "yes" ~ 1,
TRUE ~ 0),
shortness_of_breath = case_when(shortness_of_breath == "yes" ~ 1,
TRUE ~ 0))Now make the plot, using only the symptom variables. Must designate which “sets” to compare (the names of the symptom variables).
Alternatively use nsets = and order.by = "freq" to only show the top X combinations.
# Make the plot
UpSetR::upset(
select(linelist_sym_2, fever, chills, cough, aches, shortness_of_breath),
sets = c("fever", "chills", "cough", "aches", "shortness_of_breath"),
order.by = "freq",
sets.bar.color = c("blue", "red", "yellow", "darkgreen", "orange"), # optional colors
empty.intersections = "on",
# nsets = 3,
number.angles = 0,
point.size = 3.5,
line.size = 2,
mainbar.y.label = "Symptoms Combinations",
sets.x.label = "Patients with Symptom")This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Embed Rmarkdown cheatsheet Table issues HTML candies Making tables, cheatsheet contained in HTML handbook?
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Tidymodels
Liza Coyer TODO this? logitudinal data
R(t) estimations Doubling times Projections
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
rprofiles
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Embed ggplot cheatsheet
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
highlighting one line among many etc gghighlight
Cowplot Complicated method (% 100 * …)
ggrepel
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Troubleshooting tips, common errors, etc.
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook. {#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
PLOTLY
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Saving files, deleting files, creating folders, interacting with files in a folder, etc Overwriting files in Excel
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Troubleshooting common errors and warnings
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.